Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
IMO:GPT-4のような強いLLMを使って、人間による評価と同じ水準で呼応するLLM-as-a-judgeができるとのこと
Abstractより、チャットアシスタントに基づくLLMの評価
we explore using strong LLMs as judges to evaluate these models on more open-ended questions
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: #MT-bench , a multi-turn question set; and #Chatbot_Arena , a crowdsourced battle platform. (Abstract) agreementは「呼応」
4で評価
Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans.
we study the LLM-as-a-judge approach by comparing it to the gold standard of human evaluation (1 Introduction)
Table 1
We create MT-bench, a benchmark consisting of 80 high-quality multi-turn questions.
Chatbot Arena, a crowdsourcing benchmark platform featuring anonymous battles.
users can interact with two anonymous models simultaneously, posing the same question to both.
投票
3が積ん読(興味深い)
4 Agreement Evaluation
We randomly sample 3K single-turn votes from 30K arena data (4.1)
Figure 3
同じ系列のモデルを高く評価するバイアス(例:Claude (b)のグラフ)
We propose 3 LLM-as-a-judge variations (3.1)
Appendix AにPrompt Template
Pairwise comparison(Figure 5)
Single answer grading(Figure 6)
Reference-guided grading(Figure 8)
(Figure 10まである)
We present a few methods to address position bias and the limited grading ability for math questions (3.4)
Appendix B Case Studyでbiasの報告
Appendix D Additional Experimental Results
Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans.
Appendix (D.3?) によると同等を除いているらしい(関根さん)
「人が A 50%, B 50%であれば、GPT-4がAとしたら 50% 一致と判断する」(human-majority)
For example, if there are an equal number of “A” and “B” human votes for a question, and GPT4 votes “A”, the agreement is counted as 1/2 on this question.